upi:zfan253
id:556588804
The data set has 5 numeric predictors and one response variable class, which has two levels: Nasal (nasal vowels) and Oral (oral vowels). The numerical variables have all be standardised (and followed with a rounding to 4 d.p.) to have mean 0 and standard deviation 1, as also in the original data set.
phoneme = read.csv("phoneme.csv", stringsAsFactors=TRUE)
To perform the following tasks, we need load the following library:
library(mclust)
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
with(phoneme, pairs(phoneme[,-6], col=c(2,4)[Class], pch=c(1,3)[Class]))
We can sort the dataframe with class, so that the majority class will be drown first.
phoneme <- phoneme[order(phoneme$Class),]
with(phoneme, pairs(phoneme[order(Class),-6], col=c(2,4)[Class], pch=c(1,3)[Class]))
attach(phoneme)
hist(V1, freq=FALSE, breaks=100,
col="gray80", border="white",
main="Denstiy Estimation", xlab="V1")
methods <-c("nrd0", "ucv", "bcv", "SJ")
for(bw in methods) {
lines(density(V1, bw=bw), col=match(bw,methods)+10, lwd=2)
}
cols <- which(methods==methods) +10
legend("topright",legend = methods,col = cols,lwd=2)
(r = densityMclust(V1, G=1:20))
## 'densityMclust' model object: (V,6)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
plot.Mclust(r,"BIC")
Note that the mclust package compute BIC differently(Here, we need negate the bic first). According to BIC, both equal and varying variance pick Equal variance with 6 component.
r2 = densityMclust(V1, G=6,modelNames="V")
x = seq(min(V1), max(V1), len=500)
plot(r2, V1, "density", breaks=100, lwd=2, col=4)
Although the right tail of this plot is too smooth and does not fit to the density so well like the best of the KDEs(ucv or SJ) did. But for the modal(peak,middle) part, the normal mixture is more well-fitted. Overall,the normal mixture is better.
r_e_c1 = densityMclust(phoneme[Class=='Nasal',1:5],G=1:9,modelNames="EEE")
plot(r_e_c1,what="density", col=4, points.col="grey")
r_e_c2 = densityMclust(phoneme[Class=='Oral',1:5],G=1:9,modelNames="EEE")
plot(r_e_c2,what="density", col=4, points.col="grey")
r_v_c1 = densityMclust(phoneme[Class=='Nasal',1:5],G=1:9,modelNames="VVV")
plot(r_v_c1,what="density",col=2, points.col="grey")
r_v_c2 = densityMclust(phoneme[Class=='Oral',1:5],G=1:9,modelNames="VVV")
plot(r_v_c2,what="density",col=2, points.col="grey")
We can build clasifier based on Bayes’ rule. The density estimate(P(X|Y)) serve as the likelihood, and the sample proportions are the prior probability.
First, we compute resubstitution misclassification rate from EEE model.
p1 <- mean(Class=='Nasal')
p2 <- 1-p1
f_1 <- predict(r_e_c1,phoneme[,1:5])
f_2 <- predict(r_e_c2,phoneme[,1:5])
p<- f_1*p1/(f_1*p1+f_2*p2)
p <- (p<0.5)+1
mean(p==as.numeric(Class))
## [1] 0.82
Then, we compute resubstitution misclassification rate from EEE model.
p1 <- mean(Class=='Nasal')
p2 <- 1-p1
f_1 <- predict(r_v_c1,phoneme[,1:5])
f_2 <- predict(r_v_c2,phoneme[,1:5])
a<- f_1*p1/(f_1*p1+f_2*p2)
b<-f_2*p2/(f_1*p1+f_2*p2)
p <- (a<0.5)+1
mean(p==as.numeric(Class))
## [1] 0.835
First we standadize the data before using k-means.
phoneme2 <- phoneme
phoneme2[,1:5] <- scale(phoneme[,1:5])
r0 = kmeans(phoneme2[,-6], centers=2)
We reorder the dataframe based on cluster class.
names(r0$cluster) <- 1:length(r0$cluster)
phoneme2<- phoneme2[order(r0$cluster),]
pairs(phoneme2[,1:5], col=r0$cluster+1, pch=r0$cluster+1,main="K = 2 (Original Data)")
ari = double(8)
for(k in 2:9) {
r = kmeans(phoneme2[,1:5], centers=k)
ari[k-1] = adjustedRandIndex(phoneme2$Class, r$cluster)
}
ari
## [1] 0.08355201 0.11711319 0.06678252 0.09069357 0.05706456 0.04566082 0.06082125
## [8] 0.03767955
The ari values for k=2,..9 are really bad, which are results quite close to randomness. Hence the performance of k-means for this supervised learning problem is quite bad.
(r = densityMclust(phoneme[,1:5], G=2, modelNames="VVV"))
## 'densityMclust' model object: (VVV,2)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
Make the majority class as 1, minority class as 2.
classification <- r$classification%%2 +1
names(classification) <- 1:length(classification)
phoneme <- phoneme[order(classification),]
pairs(phoneme[,1:5], col=classification+1, pch=classification+1,main="K = 2 (Original Data)")
ari = double(8)
# we should reverse the class level as 2-clusters does.
for(g in 2:9) {
r = densityMclust(phoneme[,1:5], G=g, modelNames="VVV")
ari[k-1] = adjustedRandIndex(phoneme$Class, r$classification)
}
ari
## [1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [8] 0.03569147
The ari valus are also bad.
d <- dist(phoneme2[,-6])
r0 = hclust(d)
cex=1
plot(r0, cex.axis=1, cex.lab=cex, cex.main=cex)
r1 = hclust(d,method = 'single')
plot(r1, cex.axis=1, cex.lab=cex, cex.main=cex)
“Single-linkage (nearest neighbor) is the shortest distance between a pair of observations in two clusters. It can sometimes produce clusters where observations in different clusters are closer together than to observations within their own clusters. These clusters can appear spread-out.”
“Complete-linkage (farthest neighbor) is where distance is measured between the farthest pair of observations in two clusters. This method usually produces tighter clusters than single-linkage, but these tight clusters can end up very close together. Along with average-linkage, it is one of the more popular distance metrics.”
1(https://towardsdatascience.com/introduction-hierarchical-clustering-d3066c6b560e0
The two plots look very different since they use controversial metrics. single-linkage method use the closest pair of observations in two clusters as the between group distance, while complete-linkage use the farthest pair of observations in two clusters as the between group distance. Hence, the single-linkage method produce more spread-out clusters and complete-linkage produce tight clusters.
Complelte method
ari = double(8)
for(c in 2:9) {
r = hclust(d)
cluster <- cutree(r, c)
ari[k-1] = adjustedRandIndex(phoneme$Class, cluster)
}
ari
## [1] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## [8] 0.05293179
Single method
ari = double(8)
for(c in 2:9) {
r = hclust(d,'single')
cluster <- cutree(r, c)
ari[k-1] = adjustedRandIndex(phoneme$Class, cluster)
}
ari
## [1] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## [8] 0.0144141
complete method
heatmap(as.matrix(phoneme[,-6]), scale="column", distfun=dist, hclustfun=hclust,
margins=c(15,5),add.expr =list(method='now') ,main = 'Heatmap for complete method')
heatmap(as.matrix(phoneme[,-6]), scale="column", distfun=dist, hclustfun = function(x) hclust(x,method = 'single'),
margins=c(15,5),add.expr =list(method='now'),main = 'Heatmap for single method')
In this lab , we explored density estimation and clustering. First we compare kernel density estimate and univariate density estimation. Then we try multivariate density estimation for equal and varying variance subfamily for each of two levels of the target variable. After having a density estimation for Y|X, we can natural build a classifier based on Bayes’ rule. Then we move to clustering. We try k-means, mixture-based clustering, and hierarchical clustering. For each clustering method, we compute the adjusted Rand indices, and all of them are bad, which means for this dataset, using an unsupervised learning method for classification is not a good idea. Last, we compare the two methods(single and complete method) for hierarchical clustering using dendrogram and heatmap.
Introduction to Hierarchical Clustering↩